REEF: Resolving Length Bias in Frequent Sequence Mining Using Sampling

نویسندگان

  • Ariella Richardson
  • Gal A. Kaminka
  • Sarit Kraus
چکیده

Classic support based approaches efficiently address frequent sequence mining. However, support based mining has been shown to suffer from a bias towards short sequences. In this paper, we propose a method to resolve this bias when mining the most frequent sequences. In order to resolve the length bias we define norm-frequency, based on the statistical zscore of support, and use it to replace support based frequency. Our approach mines the subsequences that are frequent relative to other subsequences of the same length. Unfortunately, naive use of norm-frequency hinders mining scalability. Using normfrequency breaks the anti-monotonic property of support, an important part in being able to prune large sets of candidate sequences. We describe a bound that enables pruning to provide scalability. Calculation of the bound uses a preprocessing stage on a sample of the dataset. Sampling the data creates a distortion in the samples measures. We present a method to correct this distortion. We conducted experiments on 4 data sets, including synthetic data, textual data, remote control zapping data and computer user input data. Experimental results establish that we manage to overcome the short sequence bias successfully, and to illustrate the production of meaningful sequences with our mining algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

REEF: Resolving Length Bias in Frequent Sequence Mining

Classic support based approaches efficiently address frequent sequence mining. However, support based mining has been shown to suffer from a bias towards short sequences. In this paper, we propose a method to resolve this bias when mining the most frequent sequences. In order to resolve the length bias we define norm-frequency, based on the statistical zscore of support, and use it to replace s...

متن کامل

High Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences

Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...

متن کامل

Sampling-standardized expansion and collapse of reef building in the Phanerozoic

Geologists have long suggested that the Phanerozoic record of marine reefs fluctuated considerably through time but exactly how strong these fluctuations were is still a matter of debate. The published graphs of several authors suggest moderate fluctuations in reef abundance over the Phanerozoic (James 1983; Copper 1988; James & Bourque 1992; Hallock 1997), while other authors have depicted sub...

متن کامل

Mining Frequent Partite Episodes with Partwise Constraints

In this paper, we study the problem of efficiently mining frequent partite episodes that satisfy partwise constraints from an input event sequence. Through our constraints, we can extract episodes related to events and their precedent-subsequent relations, on which we focus, in a short time. This improves the efficiency of data mining using trial and error processes. A partite episode of length...

متن کامل

یافتن الگوهای مکرّر در قرآن کریم به‌‌کمک روش‌‌های متن‌‌کاوی

Quran’s Text differs from any other texts in terms of its exceptional concepts, ideas and subjects. To recognize the valuable implicit patterns through a vast amount of data has lately captured the attention of so many researchers. Text Mining provides the grounds to extract information from texts and it can help us reach our objective in this regard. In recent years, Text Mining on Quran and e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014